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We consider discrete nonparametric priors which induce Gibbs- 
type exchangeable random partitions and investigate their posterior 
(N ■ behavior in detail. In particular, we deduce conditional distributions 

and the corresponding Bayesian nonparametric estimators, which can 
be readily exploited for predicting various features of additional sam- 
ples. The results provide useful tools for genomic applications where 
prediction of future outcomes is required. 



1. Introduction. Random partitions and their associated probability dis- 
tributions play an important role in a variety of research areas. In population 
genetics, for example, models for random partitions are useful in order to de- 
scribe the allocation of a sample of n genes into a number of distinct alleles. 
See, for example, [10, 33]. In machine learning theory, probabilistic models 
\ for linguistic applications (such as, e.g., speech and handwriting recognition, 

00 ■ machine translation) are often based on a suitable clustering structure for a 

set of words. See, for example, [34, 35]. In Bayesian nonparametric inference, 
a discrete nonparametric prior is commonly employed in complex hierarchi- 
, cal mixture models and it induces an exchangeable random partition for the 

\ latent variables: this provides an effective tool for inferring on the cluster- 

ing structure of the observations. Such an approach is due to [21] and has 
been extended in various directions. See, for example, [12, 13, 19, 22]. Other 
^ ' important areas of applications include storage problems, excursion theory, 
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combinatorics and statistical physics. See the comprehensive and stimulating 
monograph by Pitman [29] and references therein. 

An early and well-known model which describes the grouping of n objects 
into k distinct classes is due to [7] and leads to the Ewens sampling formula. 
The basic assumption is that individuals are sequentially sampled from an 
infinite set of different species and the proportion pi with which the ith 
species is present in the population is random. Then, if (Wfc)fc>i is a sequence 
of independent and identically distributed random variables with Beta(l,#) 
distribution, the random proportions are defined as 

j-i 

(1) Pi = W u P j = W j ]J(l-W k ) Vj>2. 

fc=i 

Now, if Xi, . . . , X n is a sample of n individuals drawn from the population, 
set M n := (Mi jn , . . . , M n ^ n ) where Mj iTl is the number of species represented 
j times in the sample of size n. Hence, the distribution of M n is supported by 
all those vectors m n , = (mi. n , . . . , m n , n ) for which J2i=i ^ m i,n = n - The Ewens 
sampling formula provides the probability distribution of the random vector 
M n under (1) and it coincides with 

where (8) n = 6(0 + 1) • ■ • (0 + n — 1) for any 6 > 0. We also agree on setting 
(0)o := 1. See also [2] for a derivation of (2). Obviously, to the distribu- 
tion of M n there corresponds a distribution of the vector (i^ n ,N n ) where 
K n is the number of distinct species detected among the n observations 
in the sample and N n = (Ni jn , . . . , NK n , n ) is the vector of frequencies with 
which each distinct species is observed. Such a correspondence is one-to- 
one and, conditional on K n , the distribution of N n , is supported on the set 
&n,K n ■= {{ni,. .. ,n Kn ) G {1,... ,n} Kn :J2f=i n j = n }- In particular, for the 
Ewens sampling formula (2) there corresponds the probability distribution 

(3) Pv[K n = k, N„ = (m, . . . , n K J = ^- J] (n 3 - 1) ! 

j=1 

for any k € {1, . . . ,n} and (ni, . . . , n^) € A n ^. The parameter 6, in genetic 
applications, is interpreted as the mutation rate of each gene into new allelic 
types. Formula (3) has a further interesting combinatorial interpretation. 
If 6 is a positive integer, then ^ fc fjj=i( n i — 1)' is the number of colored 
permutations of {1, . . . ,re} into k cycles with respective lengths ni, . . . ,rik, 
each cycle being labeled by any of the 9 available colors. Accordingly, (3) 
is the probability distribution of a random permutation with colored cycles. 
See [3, 29] for exhaustive accounts on the Ewens sampling formula. 
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The distribution of the vector (K n , N n ) takes on the name of exchangeable 
partition probability function (EPPF), a notion introduced by Pitman in [26] 
and further studied in a series of subsequent papers; see [29] and references 
therein. The main object of investigation of the present paper is a family of 
EPPFs, introduced and thoroughly investigated in [9], which generalize the 
Ewens sampling scheme. Our aim is to establish distributional properties 
of such EPPFs which allow, given a sample, to make predictions according 
to a Bayesian nonparametric procedure. The concrete motivation for this 
study is provided by the straightforward applicability of the results to in- 
ference in genetic experiments. As a matter of fact, an important setting 
where our findings can be usefully applied relates to gene detection in ex- 
pressed sequence tags (EST) experiments. ESTs are produced by sequencing 
randomly selected cDNA clones from a cDNA library. Given an initial EST 
data set of size n, one is interested in the prediction of the outcomes of fur- 
ther sampling from the library. For instance, interest lies in the estimation 
of the number of new unique genes in a possible additional sample of size 
m: nonparametric frequentist estimators, however, yield completely unsta- 
ble estimates when m > 2n. See [25] for a discussion of this phenomenon. In 
contrast, for the corresponding Bayesian nonparametric estimators proposed 
in [20], and based on Gibbs partitions, the relative dimension of m with re- 
spect to n is not an issue. Indeed, we will show that the EPPF, whenever 
analytically available, yields straightforward and coherent answers to this 
and other related prediction problems. 

In Section 2 we recall the concepts of exchangeable random partition and 
EPPF and the definition of the class of exchangeable Gibbs random par- 
titions. In Section 3 we derive distributional results for the corresponding 
EPPFs conditionally on a sample: we obtain expressions for the predictive 
distribution of future observations given the past, then focus on the prob- 
ability distribution of the random partition restricted to those observations 
yielding new distinct species in the future sample and, finally, face the prob- 
lem of determining the probability that specific observed species will not 
appear in the future sample. In Section 4 we illustrate how our results can 
be applied in the context of EST analysis of cDNA libraries. The Appendix 
contains a short review of generalized factorial coefficients and the proofs. 

2. Exchangeable Gibbs random partitions. A random partition of the 
set of natural numbers N is defined as a consistent sequence II = {Hn}^^ 
of random elements, with n n taking values in the set of all partitions of 
[n] := {1, . . . , n} into some number of disjoint nonempty blocks. Consistency 
in this setting implies that each U n is obtained from n n+ i by discarding 
the integer n + 1. A random partition II is exchangeable if, for each n, 
the probability distribution of LI n is invariant under all permutations of 
(1, . . . ,re). To be more precise, let {Aj}j =1 denote a partition of the set [re], 
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and let the Aj 's be indexed by [A;] in order of their least elements. In order 
to describe the property of exchangeability for II let us introduce a sequence 
of functions n£ : A n fc — > R + such that: 

(i) n«(i) = i ; 

(ii) for any (m, . . . ,n k ) € A nj &, k 6 {1, . . . , n} and n > 1 one has 

nj. n) (m , . . . , nfc) = n| n) (n p{1) , . . . , n p{k) ) 

where p is an arbitrary permutation of the indices (1, . . . , k); 

(hi) for any (ni, . . . , n k ) 6 A nj fc, fe € {1, . . . ,n} and n > 1 the following 
addition rule holds true: 

Iljp (rai,...,7ifc) 

(4) 

fc 

= H n i n+1) (ni, . . . , nj + 1, . . . , nfc) + nj™_+ 1} (m, . . . , n k , 1). 

i=i 

A function ni with these properties is known as an exchangeable partition 
probability function (EPPF) and it uniquely determines the probability law 
of an exchangeable random partition according to the equality 

p(n n = {A u . . .,A k }) = n< n) (|Ai[, . . . , \A k \), 

where \A\ stands for the cardinality of set A. A first treatment of this con- 
cept can be found in [26], and a recent exhaustive account on exchangeable 
random partitions is provided in [29] . The above-mentioned Ewens sampling 
formula corresponds to the EPPF of the Dirichlet process [8] as described in 
(3) and it has found many interesting applications, for instance, in Bayesian 
nonparametrics and in population genetics. Another noteworthy example is 
represented by Pitman's sampling formula which corresponds to an EPPF 
of the form 

(5) Hit > (n 1 ,...,n k )= j^jT— | \[(l-(T) nj -i, 

where 9 > —a and a € (0, 1) or a < and 9 = v\a\ for some positive inte- 
ger v. See [26]. This can also be seen as the probability distribution induced 
by the species sampling model P(-) =Y^jLiPj^Xj where the A^-'s are in- 
dependent and identically distributed from some nonatomic distribution H 
and the weights pj are constructed via a stick-breaking procedure as in (1) 
the only difference being, now, that Wj ~Beta(l — a, 9 + ja) for any j > 1. 
We also agree that Wj ~Beta(l — a, 0) implies that Wj = 1 almost surely. 
The random probability P is termed the two parameter Poisson-Dirichlet 
process. See [27, 29]. 
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Another interesting example of EPPF arises from the normalization of a 
generalized gamma process, as defined in [4], and leads to 



(6) 



where j3 > and T(a;x) := s a ~ 1 e~ s ds is, for any x > 0, the incomplete 
gamma function. See [14, 28] and [19] for an application of the corresponding 
random discrete distribution in the context of mixture modeling. For general 
results concerning random probability measures derived via normalization 
procedures see [15, 16, 17, 28, 31]. 

The examples we have briefly illustrated so far share a common structure. 
Indeed, one may note that each EPPF in (3), (5) and (6) arises as a product 
of two factors: the first one depends only on (n, k) and the second one de- 
pends on the frequencies (ni,...,n k ) via the product JTj=i(l ~~ f)ry-l- This 
structure is the main ingredient for defining a general family of exchangeable 
random partitions, namely the Gibbs-type random partitions. 

Definition 1 ([9]). An exchangeable random partition IT of the set of 
natural numbers is said to be of Gibbs form if, for all 1 < k < n and for any 
(n±, . . . ,n k ) in A nk , the EPPF of II can be represented as 

k 

(7) n( n) (m, ...,n k ) = v n , k - o-) n .- U 

3=1 

for some a G [0, 1). 

It is worth noting that the previous definition holds also for negative 
values of a. See [9]. According to Definition 1, an exchangeable Gibbs-type 
random partition is completely specified once the V^'s have been assigned. 
As shown in [9], if a set of nonnegative weights "V := {V n k : k = 1, . . . , n;n > 
1} solves the forward recursive equations 

(8) V ntk = (n-ak)V n+ltk + V n+ljk+1 , 

then "V identifies the EPPF of a Gibbs-type random partition. Hence, for 
infinite exchangeable sequences of random partitions, the above recursion 
might provide a constructive approach in order to determine Gibbs-type 
random partitions. In [9], Theorem 12, one can find a complete description 
of the extreme points of "V . With reference to the previously illustrated ex- 
amples, the corresponding set of weights "V are immediately identified from 
(3), (5) and (6), respectively. Recently, [11] have investigated the dependence 
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of the distribution of the frequencies of the clusters of a Gibbs-type partition 
on their least elements and have extended some of the results contained in 
[10] relating to the Ewens sampling formula. 

Finally, note that Definition 1 directly involves infinite sequences II = 
{II n } of exchangeable random partitions. One can, however, confine oneself 
to considering just a finite sequence of partitions II = {n n }^ =1 for some 
integer N > 1. In this case, we say that II is a finite Gibbs random partition 
if it is characterized by an EPPF of the form (7), for any k G {l,...,n} 
and n G {1, ... , N}. Note that in this case, the addition rule (4) defining the 
EPPF holds true for n G {1, . . . , TV - 1}. 

3. Conditional structures of Gibbs-type random partitions. The main 
goal we are pursuing in the present paper consists in investigating some con- 
ditional structures that emerge when the observations are sampled according 
to a Gibbs-type random partition with a view to deriving Bayesian nonpara- 
metric estimators for quantities of interest. The issue we address consists in 
evaluating, conditionally on the partition of a basic sample of size n, the 
probability of sampling, if m draws, a certain number of observations yield- 
ing new partition groups with specified frequencies. Such a quantity can be 
useful in a variety of applications, some of which we highlight in Section 4. 
Resorting to the notation set forth in the Section 2, we study distributional 
properties of the partition of the set of integers {n + 1, . . . , n + m}, given [n] 
has been partitioned into j classes with respective frequencies (n±, . . . ,nj). 
A few quantities, analogous to those describing the partition structure of 
[n] , need to be introduced in advance. We let Km ^ = K m+n — K n stand 
for the number of new partition sets C±,...,C <„) generated by the addi- 

Hi 

tional m-sample X n+ i, . . . ,X n+m . Furthermore, if C := Uj=L C« whenever 
K$ > 1 and C = if id n) = 0, we set := card({X n+1 , . . . , X n+m } n C) 
as the number of observations belonging to the new clusters Cj . It is obvious 
that Lm G {0, 1, . . . ,m} and that m — Lm observations belong to the sets 
defining the partition of the original n observations. According to this, if 
S (n) = (S („),... ,5 („) ,(«)) then the distribution of S <„), conditional on 

L^m = 5, is supported by all vectors (si, . . . ? s ( n )) of positive integers such 

that Ei=™ 8i = s. The remaining m — L^m observations are allocated to the 
"old" K n groups with vector of nonnegative frequencies R n = (R±, . . . , Rr u ) 
such that 22i=a Ri = m — Lm' ■ Throughout we also assume that all random 
quantities are defined on a common probability space (Q, J^,P). 

Proposition 1. Suppose that IT = {IT n }^ =1 is a Gibbs-type exchange- 
able random partition with weights V n ^ and parameter a G [0,1). Then, the 



CONDITIONAL GIBBS STRUCTURES 7 

M (n) 

joint distribution of Km , L m and S („) , given K n and N n , is of the form 
F(Kg> = k, L$ = s, S r (n) = ( Sl , . . . , s K w )\K n = j, N n = (m ,...,n Kn )) 



(9) = F(K$ = k,L%> = s, S r(n) = (*i,..., s Kin) )\K n 



J) 



Vn+m,j+k ( m X 



V n , 



(n-ja) m - s Yii 1 - °)si-\- 

8=1 

Hence, the number K n of partition sets in the basic n sample is sufficient for 
predicting: (i) the number of sets into which {n + 1, . . . , n + m} is partitioned, 
(ii) the number of points from the subsequent m sample that belong to the 
new sets of the partition of [n + m] and (iii) the frequencies in each of these 
new groups. 

By marginalizing the conditional distribution in (9) with respect to S („) 

(n) 

and, then, with respect to K m one obtains the conditional distribution for 
the number of new groups and the number of observations belonging to these 

(n) 

new groups and the distribution of L m , respectively. These marginalizations 
yield results in terms of generalized Stirling numbers or generalized factorial 
coefficients, denoted as "^(s^jCr) and whose representation is given in (37). 

(n) (n) 

Corollary 1. The joint distribution of K m and L m , given K n , can 
be expressed as 

F(K$ = k,L$ = s\K n = j) 

(10) 

V n +m,j+kfm\ ( . . ^(s,k,a) 
J (n-J0-) r 



(n) 

for k < s = 0, . . . ,m and the conditional distribution of L m is of the form 
(11) P(LW = s\K n =j)=M(n- ja) m . s ± V ^J^M 
for s = 0, . . . ,m. 



From (10) and (11) one can also deduce other explicit forms for conditional 
distributions of interest. For example, the distribution of the number of 
observations in the new m-sample which lie in new partition sets, given the 
number of groups present in the basic n-sample and the number of new 
clusters , is of the form 

(12) P(LW = s\KV> = k,K n =j) = 0(n-ja) m . s ns,ka) 

^{m,k;a,—n + ja) 
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for s = k, . . . ,m, where "^(n, k; a, 7) is a noncentral generalized factorial co- 
efficient representable as in (39). It is worth noting that the previous expres- 
sion does not depend on the particular Gibbs prior it is derived from: in- 
terestingly, Gibbs-type random partitions share the same conditional struc- 
tures once and K n are fixed. This finding is reminiscent of a result 
in [9] where the authors show that K n is sufficient for the Gibbs random 
partition of the first n integers meaning that the conditional distribution of 
the partition of [n] given K n does not depend on the weights V n ^- On the 

(n) (n) 

other hand, the conditional distribution of K m , given L m and K n , is of 
the form 

(13) P(*W = k \ L (n) = SiKn=]) = Vny J+ &(sM/o* 

for any k G {0, . . . , s}. Moreover, the Bayes estimator (under quadratic loss 
function) for the expected number of new clusters, proposed in [20] , is easily 
recovered from (10) as 

(14) E(K^\K n = j) = ± k Vn +r+k V(rn,k-a,-n + J a) ^ 

fc=o V ^ a 

Often interest relies also in determining an estimator for the number of 
observations in the subsequent m-sample that will belong to new species. 
For instance, in genomic applications this can be seen as a better measure 
of redundancy of a certain library. For this purpose, one can resort to (11) 
and the corresponding Bayes estimator is given by 

(15) HLt\K n =j) = ±s(™){n- ja) m . s ± ^M±t^M . 

Then, ~E{L^\K n = j)/m is the expected proportion of genes in the new 
sample which do not coincide with previously observed ones. The expres- 
sion in (15) admits a noteworthy simplification as outlined in the following 
proposition: indeed, the Bayes estimator is m times the probability that 
the (n + l)th draw yields a new cluster, given that j distinct clusters are 
generated by the first n observations. 

Proposition 2. For any j E {1, . . . , n} and m>l one has 

(16) E(zW|tf B = i)=ro%2±l. 



All the previous expressions are easily available for the three examples we 
have mentioned in Section 2. We first focus our attention on the Dirichlet 
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process which represents the most well-known case. Indeed, from (3) one 
finds out that V nt k = 9 k /(9) n and, for instance, (9) reduces to 

P(L W = S ,K^ = k, S („) = (ai , • • • , 8 (n) ) = i) 

(17) 

, s A; 

(n) m - s JJ(sj - 1)!. 
i=i 



(6> + n) m V « 

Note that simple algebra leads to rewrite the above expression as 



sy\ V + nJ \{V + n) s t^ ' | (0 + n + s) 



i=l 

p e (n,m,k,s,s k ) 



where it can be immediately seen that the term in curly brackets on the 
left-hand side is the sampling formula in (9) with + n in the place of 
being the total mass parameter of the Dirichlet process conditioned on a 
sample of size re. Hence, the quantity pg(n,m,k,s,Sk) can be interpreted as 
the probability of drawing, conditional on the n past observations, a specific 
sample of size m of which s belong to the new k groups of the partition 
with vector of frequencies = (si,...,Sfc) and the other m — s coincide 
with any of the conditioning n observations. On the other hand, recall that 
linio-^o c ^"'k'°") = |s(n, k)\ where s(-, •) stands for the Stirling number of the 
first kind. This allows to determine the expressions appearing in (10) and 
(11). Indeed, one has 

P(JTW = =s\K n =j) = M 6 ^^Hn 1 k)\ 

\ s J (0 + n) m 

and, using the definition of the signless Stirling number of the first kind 
according to which 

(18) ^0 i | 5 ( S ,i)| = (0) s 

i=0 

(see, e.g., [6], page 2536), one has 

m\ (n) m - s (9) s ( m 



s 



where it is apparent that gg(n,m, s) is the probability, conditional on a 
sample of size re, of observing a specific m-sample containing s elements not 
contained in the conditioning ra-sample. 
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Remark 1. It is important to note that the conditional structure of the 
Dirichlet process does not depend on K n : it only depends on the size of the 
basic sample n. This is, indeed, a characterizing property of the Dirichlet 
process as shown in [36]. Such a property simplifies the mathematical ex- 
pressions but represents a serious drawback for applications. Indeed, it is 
reasonable to expect that K n influences prediction of the clustering struc- 

(n) 

ture of future observations: the larger K n the more new clusters K m and 

(n) 

the more observations belonging to these new clusters L m one would ex- 
pect. This is the reason which explains the interest in a more general family 
of partition distributions such as those of Gibbs-type for which prediction 
depends on K n . Finally, it is worth recalling that the Dirichlet process can 
be seen as a two parameter Poisson-Dirichlet process with parameter (9,0). 
Hence, when we deal with the Poisson-Dirichlet process in the sequel, the 
Dirichlet process case can be recovered by letting a — > 0. 

Remark 2. All the quantities described up to now, and developed in 
the next subsections, depend on the analysis of the conditional structure of 
a Gibbs-type random partition. Investigation of the conditional structure 
for the sequence of blocks (K n ) n >i is pursued in [9] where the authors do 
consider the conditional distribution of the number of groups in the par- 
tition of [n] , given the number of blocks in which [n + m] is partitioned. 
In our setting, where prediction is the main focus, we are more interested 
in evaluating conditional probabilities (or expectations) for the partition of 
future observations given the partition structure of past observations. And 
we also consider other relevant quantities, besides the number of groups. It 
might be that starting from the conditional characterizations provided by 
[9] one can derive formulae analogous to those we are now going to establish, 
but we find our approach more direct and particularly suited to the specific 
prediction problems we have in mind. 

3.1. The process generating new clusters. We are now going to consider 
an important quantity which describes the partition structure of observa- 
tions generating new groups in a further sampling procedure, conditional on 
the partition generated by the first n observations. In particular we are able 
to point out a sort of reproducibility of the Gibbs structure as established 
by the following proposition. 

Proposition 3. Let II = {II n }^ =1 be a Gibbs-type random exchange- 
able partition whose EPPF is characterized by the set of weights {V n) k '■ k = 
1, . . . , n; n > 1} and by the parameter a £ (0, 1) . Then 

F(4 n) = fc,S /W = (si, . • .,s kM )\L$ = s,K n =j,N n = (ni,.. ..n,)) 
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_ Vn+m,j+k TT/i _ \ 

for any s G {1, . . . ,m}, k G {1, . . . ,s}, j G {1, . . . ,n}, (m, ...,rij) G A nj - and 
(si, . . . , Sfc) G A s fe. Consequently the partition of the observations which be- 
long to the new partition sets is, conditional on the basic sample of size n, a 
finite Gibbs-type random partition with weights {V s ^(Tn, n,j):s = l,...,m;k = 
1, . . . , s} defined by 

(20) M^-i) = ^^M 

and with parameter a G [0, 1). 

Note from (19), again, that 

F(4"' = k, S w = ( Sl , . . . , s (n) ) = s, K n = j, N n = (ni , . . . , ry)) 

= F(K$ = k, S w = ( 5l , . . . , a W )\L$ = s, K n =j). 

The finiteness of the random partition described by (19) is obvious, since 
it takes values on the space of all partitions of [s], with 1 < s < m. More- 
over, the particular structure featured by the conditional distribution in (19) 
motivates the following definition. 

Definition 2. The conditional probability distribution 
Ur\s 1 ,...,sk;m,n,j) 

(21) 

:= P(4"' = k, S lW =(sx,..., a (n) )\L%> =s,K n = j), 
with 1 < s < m and 1 < k < s, is termed conditional EPPF. 

Hence, the probability distribution in (19) is a conditional EPPF giv- 
ing rise to a finite Gibbs-type random partition. Even if the structure of 
nj^ (si , . . . , Sjt; m, n, j) is quite general, one might wonder whether it is pos- 
sible to provide more information about its V s k(m,n,j) weights in some 
particular cases. For example, it would be interesting to ascertain when 

V^fc(m, n,j) does not depend on m and n, so that fl^\s\, . . . , Sk] m, n, j) = 

fi k s \s\, . . . , Sk] j), which means that the conditional EPPF is that corre- 
sponding to an infinite Gibbs partition. This leads us to state the following: 

Corollary 2. The conditional EPPF ft^ (si, . . . , Sk;m,n,j) does not 
depend directly on m and n if and only if it is determined from a two- 
parameter Poisson-Dirichlet random partition. 



12 



A. LIJOI, I. PRUNSTER AND S. G. WALKER 



Having the conditional EPPF IT^ at hand, one can compute some other 
interesting conditional distributions in a straightforward way. For example, 

~ (s) 

if one combines the expression for LT^. with Corollary 1 it is immediate to 
check that 



is an expression for the conditional distribution of detecting a particular 
configuration (si, . . . , s^) for the observations belonging to the new partition 
sets, given the number of new sets, the number of observations falling into 
these sets and the basic n-sample. 

All the sampling formulae we have deduced so far have important ap- 
plications in Bayesian nonparametrics and population genetics. In Bayesian 
nonparametrics, random discrete distributions are commonly employed in 
order to define a clustering structure either at the level of the observations 
or at the level of the latent variables in a complex hierarchical model. In 
particular any EPPF corresponds to some random discrete distribution and 
it represents, together with all the expressions for the conditional distribu- 
tions we have obtained, a useful tool for specifying prior opinions on the 
clustering of the data. In population genetics, the concept of conditional 
EPPF can be seen as follows. Given a sample of size n containing j distinct 
species with absolute frequencies ni, . . . ,rij, a new sample of size m is to be 
drawn. Given that s of the m observations contribute to generating newly 
observed species, that is, they belong to new distinct clusters, one might be 
interested in evaluating the probability that the s observations are grouped 
into k clusters with respective frequencies si,...,Sfc. The answer to such a 
question is provided by a conditional EPPF. The other distributions, dis- 
cussed previously, provide a wide range of sampling formulae which answer 
similar types of problems. In the following subsection we focus attention on 
some noteworthy particular cases, namely the Poisson-Dirichlet distribu- 
tion, the two-parameter Poisson-Dirichlet distribution and the generalized 
gamma partition distribution. 

3.2. Illustrative examples. We start our illustrations by considering the 
two-parameter Poisson-Dirichlet process due to [26] . The EPPF of this pro- 
cess is also known as Pitman sampling formula. Basing upon Proposition 1, 
one has 



P(S, w =( Sl ,...,s K )\K£> = k, 4"' = s,K n = j) 

vn\' L > 




s (n) )\K n =j 



) 
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and it is possible to derive explicit expressions for all the sampling formulae 
set forth in Section 2. First note that from properties of generalized factorial 
coefficients, one has 

(9 + n) m r^Vo" Jk 





9 + ia 



k=0 

■ 



(0 + n) 

= V n+m j(9 +jcr) s . 
According to this equality, from (11) one has 

(23) P[LW = S |iT n =j\ = 1 (™) (n - j<r) m _ s (0 + j<r) s . 

Now, (23) yields an estimate for the expected number of observations which 
do not coincide with the previously observed ones which, by virtue of Propo- 
sition 2, coincides with 

(24) m&Kn=j] = '^±M. 

Consider now the conditional EPPF in (19), which is associated to the pro- 
cess generating the new clusters as explained in Section 3.1. We know by 
Corollary 2 that in the two-parameter Poisson-Dirichlet case the V s k(m, n,j) 
weights do not depend on m and n. Their specific form is easily seen to be 

Ir / .\ Vn+m,j-{-k Vn+m,j+k 

V s ,k{rn,n,j) - 



T,i=o v n+m,j+iO- ic tf{s,i,a) V n+m j(9 + ja) s 

_ lift- + ia) _ n?=o(0 + 3°_ + io) 
(9 + ja) s (9 + ja) s 

with the proviso that Oi=s + jo~ + i) = 1. Hence, the conditional Pitman 
sampling formula is given by 

(25) fiW( fll , . . . ,s k ; m,nj) = ^i=^ + ja + la) ^ 

Now set 9' = 9 + ja and note that the conditional EPPF of a Poisson- 
Dirichlet process with parameter (9, a) is again a Poisson-Dirichlet process 
with an updated parameter (9', a). This can be seen as a quasi-conjugacy 
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of the two-parameter Poisson-Dirichlet process, where by quasi-conjugacy 
we mean that the process generating the new observations is of the same 
form as the prior process with updated parameters. Hence, at this stage one 
can equivalently re-express Corollary 2 above in a language quite familiar in 
Bayesian nonparametrics as follows: 

Corollary 2' . The only quasi- conjugate Gibbs-type prior is the two- 
parameter Poisson-Dirichlet process. 

Note that the quasi-conjugacy of the two-parameter Poisson-Dirichlet 
process was first shown in Pitman ([28], Corollary 20) by means of different 
techniques, whereas the characterization as the only quasi-conjugate Gibbs 
prior is new. With the Poisson-Dirichlet process with parameter (#,0), that 
is, the Dirichlet process prior with parameter measure having total mass 
6 > 0, some useful simplifications occur. For example, the conditional EPPF 
is 

(2 6 ) n« (51 , .... mi) = ffiff-'f = ^ftc - »' 

which replicates the unconditional form of the EPPF in (3) and, as expected, 
does not depend on (m,n,j). From a Bayesian nonparametric perspective, 
this is not surprising given the conjugacy of the Dirichlet process (see [8]). 
Indeed, this is just a reformulation, in a different context, of the fact that 
given a sample from the Dirichlet process, its conditional distribution is 
again a Dirichlet process. Moreover, 

P(S lL „ } =(*!,..., s Km(n) )\K$ = k, 4 n) = a, K n = j) = n -= 1 ( ( ^ fc ~ 1)! . 

A further example of exchangeable Gibbs-type random partition for which 
closed-form expressions of sampling formulae are available is the generalized 
gamma distribution (6). The conditional EPPF of the corresponding random 
partition is given by 

IL^\si,...,s k ;m,n,j) 

^Er+r 1 (w+rl)( _ 1)t ^ /CTr(j + k _ i/a . p) 

E!=o n*, h a) Er= + o m " 1 ( n+ 7 _1 ) (-i)^ /ff r(j + i - 1 1 a- (3) 



(27) 



k 

X 
i=l 



and all sampling distributions described in Section 2 can be derived in a 
straightforward way. 
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3.3. Looking backward. In this section we face the problem of determin- 
ing the probability that certain specific observations, present in the basic 
sample, are not re-observed in the additional m-sample. This is tantamount 
to deriving the probability that the new observations belong either to new 
clusters or to specified "old" clusters. 

Let A\ , . . . , Aj be the classes of K n = j sets into which the first n observa- 
tions, or n integers {1, . . . , n}, are clustered. Define M r (nJ) := M r (n,i) (h,...,i r ) 
to be, for any (ii, . . . ,i r ) G {1, . . . ,j} r such that / ii for any I ^ k, the 
event which is true if and only if none of the m observations belongs to 
any of the sets A{ where i ^ {i±, . . . ,i r }. That is, Mr"' 3 ' is true if the m 
new observations belong either to "new" clusters or to the specified "old" 
clusters A^ , . . . , Ai r . We are now interested in evaluating the probability 
of such an event. Obviously, one has r G {l,...,j} and recall that of the 
m new observations m — s are the ones belonging to the "old" clusters. 
Correspondingly, we set A r = (A^, . . . , Aj r ) to be the vector of frequencies, 
that is, Aj ; = card({n + 1, . . . , n + m} n A^ ) > for any I = 1, . . . , r and 
Y7i=i A*, =m — s. Hence, it can be seen that 

F(K n = j, N n = nj ,L$ = s, K$> = k, 8 lm) = s R{m) , A r = A r ) 

(28) 

k r j 

r=l 1=1 l=r+l 

From (28) a number of interesting distributions can be derived. They typ- 
ically provide information about the possibility of not re-observing certain 
"old" species in a subsequent "new" sample. The main result of the subsec- 
tion we wish to state is the following: 

Proposition 4. Given that the basic n-sample is partitioned into K n = 
j classes, A\, . . . ,Aj, with frequencies (m, . . . , Uj), the probability that the 
observations from the subsequent m-sample contain either elements from 
Ai 1 ,...,Ai r , with r G {1, . . . or from new clusters is given by 

F(M^\K n = j,N n = n j ) 

(29) 

_ Vn+m,j+k ffjm, k; a, ra - J2i=i n^) 

k=0 Vn >3 a 

For the two-parameter Poisson-Dirichlet process, one has 

Vn+mj+k _ (g + l)n-lllj=l" 1 (g + ^) 



V nJ a k a k (9 + l) n+m _i Ui=l(0 + ia) 

_ U-^(e + ja + ia) _ ((9 + ja)/a) k 
o- k {6 + n) m (0 + n) m 
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Hence, combining (29) with the definition of noncentral generalized factorial 
coefficient (38) in the Appendix, one has 

F(M^\K n = j,N n = n j ) 

(30) = <^£(^)/(^"-£»*) 

= (A + (j + )m 

(0 + n) m 

Such a simple expression provides the conditional probability that no integer 
in {n + 1, . . . , n + m} will belong to any of the sets Ai, with i ^ . . . , i r }, 
generated by [n]. In other terms, of the j clusters associated to the (con- 
ditioning) partition of [n], at most the r clusters with indexes i%,...,i r do 
possibly contain integers from {n + 1, . . . , n + m}. 

3.4. The case a < 0. In the previous subsections we have focused on 
Gibbs random partitions with a G [0, 1). See Definition 1. The nonnegativ- 
ity of a ensures that (7) defines the probability distribution of an infinite 
exchangeable partition. On the other hand, when a < Lemma 8 in [9] 
entails that the weights V = {V Ut k : k = 1, . . . , n; n > 1} in (7) are mixtures 
of the weights V{y) = {V n ^{y) ■ k = 1, . . . , n; n > 1}, for u = 1, 2, . . . , and 
^n,fc(^) — \(r\ k is(is — 1) • • • {y — k + 1) / (u\a\) n . Thus, the V^^'s correspond, 
for each v = 1, 2, . . . , to the weights of Pitman's sampling formula (25) and 
imply that the number of distinct species in the population is v. See [9] 
for details. According to Theorem 12(i) in [9], V arises as a mixture of the 
weights V(l), ... , V(N*), where N* G {1, 2, . . .} U {oo} and one can, then, 
obtain the same results as stated in the present section, with the proviso 
that K n < N* + 1. If N* = oo, then no relevant change occurs. In partic- 
ular, a slight modification of the proof allows one to recover the charac- 
terization of Corollary 2, with the conditional EPPF n[ s ^(si, . . . ,Sk;m,n,j) 
being defined for any j < N* + 1. Note that, in this case, the two-parameter 
Poisson-Dirichlet model coincides with symmetric Dirichlet distributions. 
See [29]. 

4. Application to the analysis of EST data. In this section we show how 
Gibbs priors can be applied in a straightforward way to the analysis of Ex- 
pressed Sequence Tags (ESTs). ESTs are generated by partially sequencing 
randomly isolated gene transcripts that have been converted into cDNA. 
From their introduction in [1], ESTs have played an important role in the 
identification, discovery and characterization of organisms as they provide an 
attractive and efficient alternative to full genome sequencing. The resulting 
transcript sequences and their corresponding abundances are the main focus 



CONDITIONAL GIBBS STRUCTURES 



17 



of interest providing the identification and level of expression of genes. Given 
a cDNA library and an initial sample of reads of size n, the main statistical 
issues to be faced are of predictive nature in the sense that various features 
of a possible additional sample of size m are to be predicted. See, for exam- 
ple, [23, 24, 32]. Such features include, for instance: (i) the expected number 
of new genes meant as an estimate of the number of new unique genes to 
be detected in the additional EST survey; (ii) the expected number of genes 
which do not coincide with genes already present in the initial sample; (iii) 
the probability that certain specific genes, present in the basic sample, do 
not appear in the additional sample. Based on these estimates important de- 
cisions are to be taken. For instance, researchers have to decide: (i) whether 
to proceed with sequencing from a certain library; (ii) whether to carry out 
a "normalization" protocol (an expensive procedure which aims at making 
the frequencies of genes in the library more uniform); (iii) which libraries, 
among several ones concerning the same organism, are less redundant in the 
sense that they deliver more information from an additional sample. 

The Bayesian nonparametric framework based on Gibbs-type random 
probability measures represents a natural, and at the same time power- 
ful, approach for dealing with these kinds of problems since it conveys, in a 
statistically rigorous way, the information present in the initial sample into 
prediction. In particular, we focus on the two-parameter Poisson-Dirichlet 
process, which stands out for its mathematical tractability. 

In order to illustrate the results of the previous section we first deal with 
a simple numerical example and then analyze some real EST data. The in- 
formation provided by an EST data set sequenced from a cDNA library 
is summarized by the size of the sample n, the number of different cDNA 
fragments j , each of which represents a unique gene and their corresponding 
expression levels. Recalling the notation set in the Introduction, M^ n stands 
for the number of clusters of size i with the initial ra-sample: within the EST 
framework M^ n is now the number of genes with expression level i. For our 
purposes it is useful to convert the Mj jn 's into the N^n's, the frequencies (or 
expression levels) of the various unique genes: hence, the sample information 
is given by n, j and (m, . . . , rij). We then assume the EST data are an ex- 
changeable sequence with nonparametric prior given by the two-parameter 
Poisson-Dirichlet process. This implies that the clustering of the ESTs fol- 
lows a two-parameter Poisson-Dirichlet random partition (5). Such a setup 
postulates the sequence of tags to be extendible to infinity: however, interest 
relies in computing estimates for m up to the size of the library, which is 
always finite implying finiteness of all the estimates. In order to specify the 
prior parameters 6 and a we resort to an empirical Bayes approach as in 
[20]. Hence, we fix a and 9 so to maximize (5) corresponding to the observed 
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sample (j, m, . . . , nj), that is, 

(31) (*, 9) = axgmax n ^ + m) n(l - a) ni ^. 

{0,6) [V + ljn-l i=1 

Given this, the model is completely specified and attention can be focused 
on predicting various features of a future sample of size m. 

4.1. Numerical example. Here we compare predictions arising from two 
different basic samples both of size n = 100. The sample sequenced from 
library 1 is composed of j = 59 unique genes with mi ioo — 40, m 2i ioo = 
10, m.3,100 = 4 > m 4 ,ioo = 2, m 5) i = 2, mi ,ioo = 1, whereas the sample se- 
quenced from library 2 consists of j = 37 unique genes such that mi ioo = 
20, m 2j ioo = 5, m 3) ioo = 4, m4,ioo = 3, m 5i i o = 2, m 6i i o = 1, mi 0) ioo = 1, 
m2o,ioo = 1. It is to be noted that the first one features a higher number of 
unique genes and the expression levels of the genes is remarkably lower. The 
average expression level, n/j, is 1.69 for the sample taken from library 1 and 
2.7 for the sample sequenced from library 2. The parameters for Pitman's 
sampling formula are set according to (31), which yields 6\) = (0.34, 33) 
and (02, #2) = (0.26, 12) for the two cases. Furthermore, we consider an ad- 
ditional sample of size m = 100. 

The expected number of new genes in the additional m-sample can be 
immediately derived from (14) and is given by 

(32) E[K$ \K n = j} = J2 k ^^ Vin, k; a, -n + ja). 

{0 + n) m 

In our case the estimator leads to predict 33 and 15 new unique genes, re- 
spectively. This is in accordance with the intuition, which leads to guess a 
higher number of new genes for library 1, since the basic sample featured 59 
unique ones in contrast with 37 of library 2. A second quantity of interest 
is the expected number of genes, in the additional sample, which do not 
coincide with previously observed ones. Such an expression is given in (24) 
and it can be seen as a better measure of redundancy of the library since, 
in contrast to (32), it takes also the expression levels of the new genes into 
account. In our case E[Lm^\K n = j] yields 40 for library 1 and 19 for library 
2. At first glance these estimates may seem low since one would expect the 

difference E[Lm^|-Kn = j] — ^[Km \K n = j] to be larger. However, it is rea- 
sonable that only a few new unique genes will have expression levels greater 
than 1: otherwise they would have been discovered already in the basic sam- 
ple. Combining the two estimates one can obtain a plug-in estimator of the 
average expression level of the new unique genes in the additional sample as 
A m :=K[Lm^\K n = j]/K[Km \K n = j], which in our case are equal to 1.21 
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for library 1 and 1.28 for library 2. If one is interested in the overall average 
expression level after n + m = 200 reads, the estimator 

(33) A n+m := (n + m)/(j + E[K^ \K n = j}) 

yields 2.17 for library 1 and 3.85 for library 2. 

Another important aspect to look at is represented by the frequency con- 
figurations of the new unique genes in the additional sample. In particular, 
one is interested in establishing which type of configurations are more likely 
to appear. By the above considerations, it is clear that these will have a 
few numbers of unique genes with significant expression level (which have 
"escaped" being sequenced in the basic sample) and all the others with ex- 
pression level 1. For detecting such a feature, we work conditionally on Km 

(n) 

and on L m which leads to the following probability distribution for S („) 
F(S S = ( Sl ,..., s k )\K$ = k, L$ = s, K n = j) 

(34) 

It is then reasonable to set K^' equal to the expected number of new unique 
genes arising from (32) and Lm equal to the expected number of genes which 
coincide with any of the newly observed genes, given in (24) with a = a and 
9 = 9 as in (31). Denote these values by k m and s m , respectively. Given this 
we consider the ratio of the distribution in (34) for two configurations 

2 

and Sr as an index for establishing which configuration is more likely to 

Km 

appear. From (34) one immediately obtains 

(35) 4"'(sL,sL: nfc(1 -<'- 



where obviously Y^i=i s i = s m ■ Let us first consider library 1 and compare 

S4Q given by 32 genes with expression level 1 and 1 gene with expression 

2 

level 8 with S 40 such that 26 genes are observed once and 7 twice. Then, 

^ioo°^ (S40) ^4o) = 34346, that is, the unbalanced configuration with only one 
gene having expression level 8 is 34346 times more likely than most balanced 
configuration. If we compare the first configuration with S 40 given by 31 
genes with expression level 1 and two genes with expression levels 4 and 5, 
respectively, then it appears that configuration 1 is "only" 60 times more 
likely. For library 2, things are quite different, even though the unbalanced 
configuration still predominates. By comparing S 19 given by 14 genes with 

2 

expression level 1 and 1 gene with expression level 5 with S 19 , where 11 
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genes are observed once and 4 genes twice, one has -Zjoo°^(Si9> S^ 9 ) = 44. This 
means that the odds in favor of the unbalanced configuration with respect 
to the most balanced are "only" 44. By taking an intermediate configuration 
such as 13 genes observed once and two observed 2 and 4 times, respectively, 
the odds reduce to 5. 

Finally, it is also worth looking backward, in the sense of determining 
probabilities that certain genes present in the initial sample will not be re- 
observed in the additional survey of size m. Since one is typically interested 
in probabilities concerning the most highly expressed genes, or the genes 
with expression level 1, it is useful to order the frequencies in the initial 
sample in increasing order and denote them by ?i(i),H(2)j • ■ • i n (j)- Then, 
from (30), the probability of not re-observing the j — r most highly expressed 
genes is given by 

^ (0 + ti -r)a + E r i=in { i))m 



(0 + n) m 

In order to avoid that probabilities take on too-low values, set m = 10. As for 
library 1, the probability of not observing the unique gene with expression 
level 10 is 0.482, whereas the probability of not observing the 40 genes with 
expression level 1 is 0.118. It is also worth noting that the probability of 
not observing certain specific 10 genes with expression level 1 (out of the 40 
present in the initial sample) is given by 0.611. From this, one can see that it 
is more likely to re-observe a gene with expression level a than a genes with 
expression level 1: this appears to be a reasonable and, indeed, desirable 
feature for a model dealing with species prediction problems. As for library 
2, one can, for instance, compute that the probability of not re-observing 
the unique gene with expression level 20 is 0.156, while the probability of 
not re-sequencing the 20 genes with expression level 1 is 0.257. Again, the 
probability attached to highly expressed genes is more than proportional 
with respect to genes with expression level 1. Finally, note that these proba- 
bilities are not directly comparable between libraries: this is due to the fact 
that library 1 exhibits a higher estimate of new genes to be discovered in the 
additional sample and also a higher number of observations which belong to 
these new clusters: consequently, it is natural that the probabilities of not 
re-observing certain genes are always higher for library 1. 



4.2. Genomic example. Here we analyze a tomato-flower cDNA library 
from the Institute for Genomic Research Tomato Gene Index with library 
identifier T1526 [30]. This library was made from 0-3 mm buds of tomato 
flowers and was previously analyzed in [20, 23, 24] with reference to the 
determination of the discovery probability of further reads from the library. 
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Table 1 

Description of the 10 sub-samples of size n = 1000 and the true values of the 
quantities to be estimated on the remaining m = 1586 data: the second column 

indicates the number ATiooo = j of distinct genes in the sub-sample; columns 
3-6 report the true values of (k[1° s I°^ \K 100 o) , (-^1586° l-Kiooo), A 15 g 6 and A 2 s86 



N 


3 




^true 


A true 
^■1586 


A true 
^2586 


1 


825 


1000 


1166 


1.166 


1.417 


2 


816 


1009 


1142 


1.132 


1.417 


3 


806 


1019 


1151 


1.130 


1.417 


4 


834 


991 


1146 


1.156 


1.417 


5 


820 


1005 


1150 


1.144 


1.417 


6 


831 


994 


1145 


1.152 


1.417 


7 


819 


1006 


1150 


1.149 


1.417 


8 


813 


1012 


1130 


1.117 


1.417 


9 


812 


1013 


1135 


1.120 


1.417 


10 


830 


995 


1157 


1.163 


1.417 



The initial sample consists of n = 2586 ESTs with j = 1825 unique genes. 
The tomato flower data set shows the following expression levels: 

™-i,2586 = 1434, 253, 71, 33, 11, 6, 2, 3, 1, 2, 2, 1, 1, 1, 2, 1, 1 

with i € {1, 2, . . . , 14} U {16, 23, 27}, which means that we are observing 1434 
genes which appear once, 253 genes which appear twice, etc. The average 
expression level of the basic sample is 1.417. 

We first perform a cross-validation study for assessing the performance 
of the method. To this end 10 sub-samples of size 1000 have been drawn 
without replacement from the available 2586 EST sample. On the basis of 
each sub-sample, the corresponding values of (a, 9) have been fixed according 
to (31). Then, we have computed the estimators for an additional sample 
of size m = 1586, which corresponds to the remaining observed data. In 
addition to the Bayes estimates ~K{Km\K n = j) and ~E(Lm^\K n = j), we 
also computed, using the distributions of (Km^\K n = j) recoverable from 
(10) and of (I$\K n = j) given in (23), the 95% highest posterior density 
(HPD) intervals; these represent the Bayesian counterpart to frequentist 
confidence intervals. Finally, also the estimates for the average expression 
levels have been computed. Table 1 reports the true values corresponding to 
each sub-sample, whereas Table 2 displays the estimates with corresponding 
95% HPD intervals. 

By comparing Table 1 and 2, one sees that 9 times out of 10 the highest 
posterior density interval covers the true number of distinct genes present in 
the additional sample, whereas the true number of genes not coinciding with 
previously observed ones is always covered. The average prediction errors are 
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24.5 and 21.2 genes, respectively. The average error in the estimation of the 
expression level of the additional sample is 0.0026, whereas the average error 
of the estimates of the overall expression level is 0.019. Given the fact that 
prediction is carried out over an additional sample of size about 1.5 times 
the used sub-sample, such results appear completely satisfactory. 

We now deal with the problem of predicting the outcomes of future se- 
quencing and, as possible sizes of the additional sample, we consider m € 
{250, 500, 750, 1000}. As for the prior specification of (a, 6) the maximization 
in (31) leads to (<r, 6) = (0.612, 741). The corresponding estimates for the ex- 
pected number of new genes (32) and for the number of genes which do not 
coincide with previously observed ones (24) are reported in Table 2 together 
with the corresponding 95% HPD intervals. The estimates of the average 
expression level of the new unique genes and of the average expression level 
for the whole sample of size n + m are also reported. 

It is worth noting that for this real data set the two estimates in the 
first two columns are extremely close, leading to an extremely low average 
expression level for the new unique genes. This can be explained by two 
facts: (i) the number of genes with expression level 1 is already very high in 
the basic sample {mi^bm = 1434); (ii) the basic sample is large (n = 2586) 
and, hence, it is very unlikely that several highly expressed genes have not 
been sequenced. In such a case the frequency configuration of the additional 
sample is forced to be unbalanced and there is no need to compute (35) to 
state this. Just note that the most balanced configuration for m = 250 would 
be 136 genes with expression level 1 and 2 genes with level 2. 

Table 2 

Predictions, based on the sub-samples, of the quantities of interest on the remaining 
m — 1586 data: columns 2-3 display the parameter specifications derived from (31); 
columns 4-7 report the Bayes estimates K := ElK 1 ^^ |_Kiooo =j] and 
L := S[L^ggg l-Kiooo = j] with the corresponding 95% highest posterior density (HPD) 
intervals; columns 8-9 display the estimates for the average expression level in the 
additional sample and in the whole sample denoted by Aisse and A25S6, respectively 



N 




e 


K 


HPD 95% 


L 


HPD 95% 


^4.1586 


^4.2586 


1 


0.72 


444 


1017 


(969, 1067) 


1140 


(1089, 1190) 


1.121 


1.404 


2 


0.75 


344 


1012 


(963, 1065) 


1128 


(1076, 1180) 


1.115 


1.415 


3 


0.78 


254 


1009 


(959, 1063) 


1116 


(1063, 1170) 


1.106 


1.425 


4 


0.75 


410 


1049 


(1001, 1101) 


1165 


(1115, 1215) 


1.111 


1.373 


5 


0.65 


583 


979 


(932, 1028) 


1118 


(1068, 1168) 


1.142 


1.437 


6 


0.73 


446 


1034 


(986, 1084) 


1155 


(1104, 1204) 


1.117 


1.387 


7 


0.72 


420 


1004 


(955, 1055) 


1128 


(1077, 1179) 


1.124 


1.419 


8 


0.72 


397 


992 


(943, 1043) 


1115 


(1063, 1167) 


1.124 


1.433 


9 


0.69 


457 


976 


(928, 1028) 


1108 


(1056, 1159) 


1.135 


1.446 


10 


0.72 


466 


1027 


(980, 1078) 


1151 


(1100, 1200) 


1.121 


1.393 
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Table 3 

Estimates arising from the two-parameter PD model with [a, ff) = (0.612,741) for 
sizes of the additional sample corresponding to m 6 {250,500,750, 1000}. K and L 
denote the Bayes estimates E^™ 086 ' 1-K2586 = 1825] and E[L^ 586 ' |A"2586 = 1825], 
respectively, and are reported together with their 95% highest posterior density 
(HPD) intervals. Columns 6 and 7 display the estimated average expression levels 



m 


k 


95% HPD 


L 


95% HPD 


A 

rn 


j4.2586+m 


250 


138 


(122, 156) 


140 


(124, 155) 


1.014 


1.445 


500 


272 


(249, 297) 


279 


(256, 302) 


1.026 


1.471 


750 


402 


(373, 433) 


419 


(390, 448) 


1.042 


1.498 


1000 


530 


(496, 566) 


558 


(523, 593) 


1.053 


1.522 



Another issue of interest is the determination of the probability of not 
re-observing certain particular genes in the additional m-sample. This can 
be achieved via the expressions in (30) and (36). With reference to the 
EST data set we are analyzing, the probability of not re-observing genes 
with expression level larger than 10, which correspond to 9 genes with 
frequencies 11,11,12,13,14,16,16,23,27, is given by 0.656,0.123,0.016 for 
m = 10,50,100, respectively. The probability of not observing the 71 genes 
with expression level 3 is given by 0.593,0.075,0.006 for m = 10,50,100, 
respectively. 

APPENDIX 

A.l. Generalized factorial coefficients. The results in the previous sec- 
tions rely on the generalized factorial coefficients: here we provide a short 
account of their definitions and of formulae for their evaluation. For fur- 
ther details and pointers to the literature, the reader can refer to [5, 6]. See 
also [9]. For any n > 1 and k = 0, . . .,n, the generalized factorial coefficient 
^(n, k; a) coincides with the coefficient of the kth. order factorial of t in the 
expansion of the nth order generalized factorial of t with scale parameter 
a G R, that is, 

n 

{at) n = Y J ^in 1 k-a){t) k . 

k=0 

In order to determine the distribution of the number of different species 
appearing in a sample of size n, that is, K n , we have resorted to the following 
representation: 
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with the proviso that ^(0,0; a) = 1 and ^(n^cr) = for all n > 1. It is 
to be noted that ^ slightly differs from the definition of generalized fac- 
torial coefficient C(n,k;a) as given, for example, in [5, 6]. Indeed, one has 
tf(n,k;a) = (-l) n - k C(n,k;a). 

Besides ^(n,k;a) we consider another quantity ^(n,k;a,j) which is 
known as noncentral generalized factorial coefficient. It is defined as the 
coefficient of the kih order factorial of t in the expansion of the nth order 
noncentral generalized factorial of t, with scale parameter a and noncentral- 
ity parameter 7, that is, 

n 

(38) (crt-j) n = Y,V(n,k;a, 1 )(t) k . 

k=0 

Note that in [5] the definition of noncentral generalized factorial coefficient 
designates a quantity C(n,k;a,j) = (— \) n ~ kc €(n, k; a, 7). From (2.60) in [5] 
it is seen that it can be represented as 

(39) V(n,k; a, 7) = - £(-1)* ( *) (-aj - 7 ) B 

K -j=o KJ/ 

and this can be usefully employed in order to evaluate the probability of 
discovering a new species. Moreover, from (2.56) in [5] it is possible to es- 
tablish a connection between noncentral and central generalized factorial 
coefficients 



(40) tf(n, k;a, 1 ) = J2[ k; <r)(-7)n— 

s=k ^ ' 

Finally we briefly recall the relation to Stirling numbers. Indeed, 

V n,k,a = \ s ( n fy\ 

a— >0 <J 

where, as before, \s(n, k)\ is the signless Stirling number of the first kind. 
Moreover, one has 

^{n,k;a,-f) " / n \ 

]™ — V k — = Mi J \ 8 (*M(-y)n-i- 

i=k 

A. 2. Multivariate Chu Vandermonde formula. Here we present a multi- 
variate version of the celebrated Chu- Vandermonde identity. In, for example, 
[5] the following version of the Chu- Vandermonde identity is presented: 

(41) [a + 6]» = E(")Wr[6]n-r 

r=0 ^ ' 
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for any a and b in R, where [x] n := x(x — 1) • • • (x — n + 1) stands for the 
descending factorial. Since a multivariate version in terms of rising factorials 
seems not readily available in the literature we present it together with a 
proof. 

Lemma A.l. For each q, j > 1, set Aj^ q = {(<?i, . . . ,qj) : q% > 0, £)j=i q% = 
q}. Then 

(42) E ( a , q . a )t[( a i)n i+qi -i=(n-j + i2^i] IlKk-i 

(gi,..., gj )€^, 9 Vyl yj/ i=l V i=l /gt=l 

where (rii, . . . , rij) is such that rii > 0, /or i = 1, . . . , j and J2i=i n i = n - 

Proof. Since r ^ n) = (a) n = (-l) n [-a]„, from identity (41) one de- 
duces that 

n / \ 

(43) (o + 6)„ = X;(")(o)r(6)n-r. 

r=0 ^ ' 

The proof now follows by inductive reasoning. Suppose the identity holds 
true for j — 1, that is, 

q\ j ~ 2 
^ H7\ a- n\n- 1 1 ( a J'-l)nj-i+«j-i-l II (^Ww" 1 

/ j-i \ i-i 

= n-(j-l) + E«i II^)^- 1 ' 

\ i=l /gi=l 

and we show it holds for j as well. Indeed, observe that 



! 

n,_i+(j,_i-l 



= £ - — f7T~; — v^'- 1 ^ 

?i _ 1=0 ^-l!W-?i-i)! 

E(9-9 3 -_i)! , . j?. . 

— j (aj)n j+9j -i J_J_(aij„. +? ._i. 

( g i,...,«,--i)eA,_ 1 „_ ej _ 1 ^ *- 2 -*- *=i 
By the induction hypothesis, the second factor above equals 

( j \ j ~ 2 

I n - Hj-i - (j - 1) + E a « - a j-l ] ( a j)rij-l JJ(ai)rH-l- 

V i=l / i=l 
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Finally the proof is completed by virtue of (43), after noting that 

(%--l)nj_i+g 3 --i-l = (ai-l)nj-i-l(^i-l ~ 1 + □ 

Lemma A.l can also be proved by combining the last relation displayed 
in the proof with the definition of the multinomial-Dirichlet distribution, ac- 
cording to which E(,i,... )3 ,-)eA J q i qi q .. q ) ni=i(°i)ni+«-i/(n ~J + H=i a i)g = 
1. 



A.3. Proofs. 



Proof of Proposition 1. An obvious point to start from is the fol- 
lowing: if we have seen n observations partitioned into j distinct groups, 
then the conditional probability that the next q > 1 observations provide no 
new groups is n?=i(l ~~ which, using the recursive formula 

(8) for V njk , is given by 

P(K^=0\K n = j,N n = n J ) 

1 T/ T/ 

n( i ; i • \ v n-+l,j I ■ \ v n+q,j 

1=1 V n+l-l,j v n,j 

On the other hand, suppose we have seen n + q observations yielding K n+q = 

j groups. Then the conditional probability of obtaining K^ l+q ^ = k new 
groups of sizes si, . . . , s& from the next s observations, where none of these 
coincides with the first n + q, is given by 



Vn+q,j i=1 

where si + • • • + = s. If we now set q + s = m, the conditional probabil- 
ity of obtaining new groups with respective frequencies s\, . . . ,Sk in the m 
observations following on from n, given K n = j, is, due to exchangeability, 
found by multiplying the two conditional probabilities above and including 
the (™) term. Hence one achieves (9). Note that an alternative proof can be 

given by considering the joint distribution of (i^ n ,N„, K$ ,L$ ,S („), A,) 
where A,- = (A, r („>,..., A. r r n )) is the vector of nonnegative integers 
denoting the number of new observations in each of the j groups into which 
the first n observations are partitioned, and then by using Lemma A.l. □ 

Proof of Proposition 2. The proof works by induction. Let us first 
note that for any m > 1 one has 

r (n) _ j (n) , tt 
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where H n , m = I X c(X n+m+1 ) and X n = {Xi, . . .,X n }. Let us first fix m = 1 
and determine 

E[4 l) \K n =j]=E[L i ( l) \K n =j]+E[H n:1 \K n = j}. 

The first summand is clearly equal to V n+ ij + i/V n j. As for the second sum- 
mand, one can use the assumption of exchangeability which yields 

E[H nil \K n = j] = E[I X c (X n+2 )\K n = j] 

= E[I X o(X n+1 )\K n =j) 

_ Kt+ij+i 

Hence, (16) holds true for m = 2. Now, suppose (16) is valid for m and let 
us show this implies it is still true for m + 1 . This means we shall determine 

E[L^l 1 \K n =j]=E[Lt ) \K n =j]+E[H n>m \K n =j]. 

By assumption, E[Lm \K n = j] = mV n+ ij + i/V rh j. Moreover, exchangeabil- 
ity again entails that the second summand above is V n+ ij + i/V n j and the 
conclusion follows. □ 

Proof of Proposition 3. This is straightforward and follows from 
taking the ratio between (9) and (11) in Corollary 1. □ 

Proof of Corollary 2. Let 

*n-\-m,j-\-k 



fj,a{s, k) 



which, by assumption, does not depend on n and m. Then, if s = 2 and k = 2 

K+m,(j-2)+2 _ , , . 

and, with s = 2 k = 1 

Vn+m,(j-2)+l , (0 

- Jj-2,o-( l 2, 1). 



If we, now, consider the ratio of these two expressions, we obtain the identity 

Vn+m,j _ /j-2,o-(2; 2) 
/i-2, CT (2,l)' 

From this, one sees that 

K+raj J~J ^n+m,i J~J /i,cr(2,2) 

Vn+m,2 i=3 Vn+m,i-l i=l /i,<r(2, 1) 
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so that V n+m .j = V n+mt2 Ui=i(,fi^( 2 ^ 2 ) / fiA 2 ' !)) = 9i{ n + m )92{j) for some 
functions g± and g 2 - By a result of Kerov ([18], Theorem 7.1) (see also [9]) 
this entails that the weights V n ^ are those from the two-parameter Poisson- 
Dirichlet process. □ 

Proof of Proposition 4. Consider the expression displayed in (28) 
and sum with respect to all vector s/% in A s ^ to obtain 

F(K n = j, N n = n, , L&) = s , k£> = k,A r = X r ) 

(44) 



17 i=l «=r+l 

Now exploit Lemma 1 in order to integrate out A r thus obtaining 
F(K n = j, N n = nj , L$ = s, = k, M r W ) 



(45) 



v n+md+ k — — k — \2^ n h- r < J ) IKi-^k-i- 



a" 



\l=l / m— s i=l 



Finally, integrating out Lm' and i^m an d summing over k = 0, . . . ,m one 
has 

P(iT n =j,N n = nj ,M r ( n )) 

(46) 

= II C 1 - E ^±J^ m, k; a,ra-Y: n H . 

i=l k=0 \ 1=1 ) 

Hence, the ratio of (46) over the EPPF IT^ (m, . . . , Uj) yields the result in 
(29). □ 
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